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ABSTRACT 

We  propose  a  classification  technique  for  face  expression 
recognition  using  AdaBoost  that  learns  by  selecting  the  rel¬ 
evant  global  and  local  appearance  features  with  the  most 
discriminating  information.  Selectivity  reduces  the  dimen¬ 
sionality  of  the  feature  space  that  in  turn  results  in  signifi¬ 
cant  speed  up  during  online  classification.  We  compare  our 
method  with  another  leading  margin-based  classifier,  the 
Support  Vector  Machines  (SVM)  and  identify  the  advan¬ 
tages  of  using  AdaBoost  over  SVM  in  this  context.  We  use 
histograms  of  Gabor  and  Gaussian  derivative  responses  as 
the  appearance  features.  We  apply  our  approach  to  the  face 
expression  recognition  problem  where  local  appearances 
play  an  important  role.  Finally,  we  show  that  though  SVM 
performs  equally  well,  AdaBoost  feature  selection  provides 
a  final  hypothesis  model  that  can  easily  be  visualized  and 
interpreted,  which  is  lacking  in  the  high  dimensional  sup¬ 
port  vectors  of  the  SVM. 
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1  Introduction 

Current  classification  techniques  define  image  similarities 
at  varying  levels  of  detail.  Those  that  use  only  global  fea¬ 
tures  such  as  color  [1]  and  texture  histograms  [2]  tend  to 
be  expensive  in  both  memory  and  computation  because  of 
the  dimensionality  of  the  feature  even  though  such  features 
can  lend  more  flexibility  and  information  to  the  classifica¬ 
tion  tasks.  Applying  Principal  Component  Analysis  (PCA) 
can  reduce  the  dimensionality  without  sacrificing  perfor¬ 
mance  [3],  Even  then  one  cannot  avoid  a  high  dimensional 
feature  space  altogether  since  the  principal  components  in 
PCA  are  still  linear  combinations  of  the  original  features. 
Tieu  and  Viola  [4]  present  a  boosting  approach  to  select  a 
small  number  of  features  from  a  very  large  set  allowing  fast 
and  effective  online  classification.  Their  method  applies 
hand-crafted  primitive  kernels  recursively  on  subsampled 
images  resulting  in  a  causal  structure  that  is  used  to  dis¬ 
criminate  image  classes.  These  selective  features,  however, 
are  only  meaningful  in  the  feature  spaces  and  it  is  difficult 
to  intuitively  interpret  the  structure  of  the  selected  features. 


Personal  experiences  and  psychophysical  studies  in 
saccadic  eye  movements  [5]  indicate  that  local  appearances 
play  crucial  roles  in  learning  good  classification.  More  of¬ 
ten  than  not,  people  can  recognize  objects  because  they 
seek  particular  regions  where  discriminating  information 
is  located.  For  example,  to  classify  a  car  based  on  make 
or  model,  the  focus  of  attention  is  on  small  regions  at  the 
front/back  of  the  car  where  the  name/symbol  is  printed. 
The  observation  of  the  shapes  of  head/rear  lights  is  also  sig¬ 
nificant.  On  the  same  note,  certain  other  regions  like  wind¬ 
shields  or  tires  do  not  carry  that  much  information.  Model¬ 
ing  from  this  finding,  a  classification  mechanism  should  be 
able  to  discard  most  of  the  irrelevant  image  regions  without 
sacrificing  performance. 

Techniques  that  depend  only  on  local  regions  [6,  7] 
capitalize  on  this  insight  by  attempting  to  segment  an  image 
into  blobs  and  focus  only  on  the  similarities  between  blobs 
using  colors  or  textures.  Minut  et  al.  [5]  uses  eye-saccade 
data  to  generate  sequential  observations  that  are  then  used 
to  learn  a  Hidden  Markov  Model  (HMM)  for  each  face  in 
the  database.  The  observed  locations  turn  out  to  be  more 
important  than  the  observation  sequence,  though  the  latter 
fits  nicely  into  a  HMM  framework.  Jaimes  et  al.  [8,  9]  pro¬ 
pose  a  strong  correlation  between  eye-movements  and  dif¬ 
ferent  semantic  categories  of  images  and  use  this  hypothe¬ 
sis  to  build  automatic  content-based  classifiers. 

Following  this  motivation  to  look  for  locally  discrim¬ 
inative  appearances,  we  propose  to  identify  from  a  high  di¬ 
mensional  feature  space  only  those  dimensions  that  carry 
the  most  information.  In  this  paper,  we  present  an  approach 
using  AdaBoost  to  select  features  as  part  of  the  training 
phase  itself  thereby  making  the  feature  extraction  process 
in  the  testing  phase  very  efficient.  Our  approach  differs 
from  previous  work  in  that  the  reduced  set  of  features  con¬ 
tains  both  locally  and  globally  informative  features.  Our 
system  automatically  singles  out  the  discriminative  features 
and  consequently  the  discriminative  image  regions  with¬ 
out  relying  on  a  priori  domain  knowledge.  Finally,  by  a 
novel  combination  of  feature  extraction  and  feature  selec¬ 
tion  classification  techniques,  we  show  that  an  overlay  of 
these  region  selections  over  an  original  image  enables  us 
to  visualize  actual  image  regions  that  carry  relevant  infor¬ 
mation  that  is  crucial  for  the  classification  task.  Thus,  the 
proposed  technique  not  only  significantly  reduces  the  prob- 
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lem  complexity  and  speeds  up  the  online  process,  but  also 
provides  a  meaningful  interpretation  of  image  regions  to 
the  classification  process. 

In  the  following  section,  we  define  a  composite  fea¬ 
ture  that  is  a  concatenation  of  global  and  local  appearance 
features  extracted  from  uniformly  partitioned  regions  of  an 
image.  The  appearance  features  are  derived  from  either  Ga¬ 
bor  wavelets  or  Gaussian  derivative  filters  applied  to  each 
partition.  We  present  experimental  results  about  the  per¬ 
formance  of  our  approach  on  the  problem  of  recognizing 
facial  expressions  like  smile,  scream  etc.  This  problem  is 
particularly  suited  for  our  approach  because  local  regions 
of  the  image  are  sufficient  to  determine  an  expression  cat¬ 
egory.  Finally,  we  compare  the  results  of  this  approach  to 
classification  using  a  standard  technique  such  as  the  SVM 
and  point  out  the  essential  differences. 

2  Features 

2.1  Multi-scale  Gaussian  derivative  features 

Let  Ix  be  the  first  order  partial  derivative  with  respect  to  x 
of  image  I.  Iy  is  defined  similarly  and  Ixx,  Ixy,  Iyy  denote 
the  second  derivatives.  These  derivatives  are  more  stable 
when  computed  by  filtering  the  image  with  the  correspond¬ 
ing  normalized  Gaussian  derivative  than  by  the  regular  fi¬ 
nite  differences  method  [2],  A  2D  Gaussian  function  with 
zero  mean  and  standard  deviation  a,  is  defined  as 

1  x2  +  y2 

9i(x,y,a)  =  ■  e  2°2  (1) 

The  parameter  a  is  also  referred  to  as  a  scale  of  the 
Gaussian  function.  Local  intensity  surface  orientation  and 
curvature  features  can  be  defined  as  functions  of  the  partial 
derivatives: 
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The  orientation  and  the  shape  index  computed  at  ev¬ 
ery  pixel  are  discretized  and  histograms  representing  dis¬ 
tributions  of  these  features  are  constructed.  These  features 
are  good  in  modeling  dominant  local  gradients  as  well  as 
curvatures  [10].  Let  the  feature  vector  be  represented  as 

0a  =  [C£;  Op]ix2b  (3) 

where  a  is  the  scale  factor,  b  is  the  number  of  histogram 
bins,  C  and  O,  respectively,  are  the  shape  index  and  the 
orientation  histograms.  Additionally,  instead  of  concatena¬ 
tion,  either  C  or  O  could  be  used  as  stand  alone  features. 
An  example  of  this  feature  is  shown  in  Figure  1 . 


2.2  Gabor  wavelet  features 

The  Gabor  wavelet  filter  [11]  is  defined  as 

92{x,y,a1u0,v0)  =  g1(x,y,a)  ■  eJ'(“0+Uo)  (4) 

where  (uq,  i’o)  is  the  center  frequency  of  the  filter.  Let  the 
Gabor  filter  response  at  scale  cr  and  location  (x,  y)  on  an 
image  be  represented  by  ipr7(x,y).  These  filter  responses 
are  discretized  into  b  intervals  that  define  the  b  bins  of  the 
histogram.  Let  b and  b^nax,  respectively,  be  the  lower 
and  upper  limits  of  the  kth  bin.  Then  a  value  at  each  bin  k, 
represented  by  is  equal  to 

0a  =  0a 4,  2/)  (5) 

The  feature  vector  (l>„  at  scale  a  is  then  defined  as 

0a  =  [<j>l,<l>l,...,<j>ba]lxb  (6) 

An  example  of  this  feature,  <fia,  is  shown  in  Figure  1. 
This  feature  is  different  from  a  regular  histogram  because  it 
accumulates  into  each  bin  the  actual  filter  responses  instead 
of  the  pixel  counts.  In  a  regular  histogram  of  the  responses, 
the  bins  whose  range  are  near  zero  will  tend  to  dominate 
the  histogram  when  significant  areas  of  the  image  are  tex¬ 
tureless.  Clearly,  this  is  not  desirable  and  therefore  these 
responses  need  to  be  weighted  less.  This  happens  naturally 
in  our  suggested  approach  by  accumulating  actual  filter  re¬ 
sponses.  Further,  it  is  desirable  to  pay  more  attention  to 
the  filter  responses  with  extreme  values  as  they  indicate  a 
better  (or  worse)  alignment  between  local  image  structures 
and  the  filter  patterns. 

2.3  The  Composite  features 

Both  the  Gaussian  functions  and  the  Gabor  wavelets  can  be 
viewed  as  bandpass  filters  whose  bandwidths  depend  on  cr, 
their  scale  parameter.  Consequently,  multiple  scale  filtering 
is  essential  in  order  to  extract  a  wide  range  of  frequencies. 

Local  features  extracted  from  different  image  regions 
may  be  noticeably  different.  This  distinction  is  lost  if  only 
a  global  histogram  is  constructed.  We  preserve  both  global 
and  local  features  by  partitioning  an  image  into  PxQ  dis¬ 
tinct  blocks.  The  number  of  partitions  is  determined  by  the 
scale  at  which  the  features  are  extracted.  Specifically,  at 
the  coarsest  scale  with  relatively  high  cr,  a  1  x  1  partition  is 
sufficient  because  the  coarse  filter,  having  a  large  footprint, 
looks  for  global  structures.  At  finer  scales,  having  smaller 
filtering  footprints,  more  partitions  are  needed. 

The  composite  feature  vector  at  a  scale  cr  whose  cor¬ 
responding  partition  is  PxQ  is 

0(T  [0(T,1  1  0(7,2  5  ***)  0(7,  P-Q]  (7) 

where  <4,p  is  the  appearance  feature  vector  at  scale  cr  ex¬ 
tracted  from  the  pth  subimage  of  the  partition  PxQ.  The 
concatenation  over  M  scales  gives  the  final  composite  fea¬ 
ture  vector: 

$  =  [0ai;0a2--;0aM]  (8) 


3  Classifiers 

We  describe  two  margin-based  classification  paradigms 
used  in  our  experiments:  the  feature  selective  AdaBoost 
classifier  and  the  SVM-based  classification  approach. 

In  a  two-class  classification  problem,  let  a  set  of  n 
data  points  in  an  N  dimensional  feature  space  be 

G  ^N,  Vi  G  {±1} 

(9) 

A  pair  is  called  a  positive  instance  if  y*= 1,  and  a 

negative  instance,  otherwise.  A  classifier  seeks  a  decision 
function,  or  a  hypothesis,  H:  that  minimizes 

some  loss  function. 

3.1  AdaBoost 

Recall  that  each  image  i  £  {1, 2, ...,  n}  is  represented  by 
<!>;,  a  concatenation  of  global  and  local  appearance  fea¬ 
tures  (po-.p.t  extracted  at  different  scales  from  different  sub- 
regions.  From  this  composite  feature,  AdaBoost  [12]  learns 
the  classification  by  selecting  only  those  individual  features 
that  can  best  discriminate  among  classes.  We  achieve  this 
by  designing  our  weak  learner  as  suggested  by  Howe  [13]. 

Training  on  an  individual  appearance  feature  < a 
decision  boundary  hyperplane  k  is  a  bisection  between  the 
weighted  mean  vectors  of  the  positive  and  negative  sample 
sets. 

k  _  Svie^*’  ^(*)  '  .  SvtG<i>"  ^(®)  '  $ <?,%>,% 

II  -^(*)  '  4><t,pA\  II  Svie4>n  -^(*)  '  </>c,p,i|| 

(10) 

where  =  1}  and  $"=  {j\yj  =  —1},  and  each 

sample  is  weighted  by  the  distribution  D. 

The  positive  half-space  of  k  usually  contains  a  ma¬ 
jority  of  the  positive  instances.  Therefore,  if  a  sample  be¬ 
longs  to  this  half-space,  it  is  classified  as  positive,  and  neg¬ 
ative  otherwise.  The  decision  is  flipped  when  a  minority 
of  the  positive  instances  fall  into  the  positive  half-space. 
Symbolically,  this  weak  hypothesis  is  a  function  h  =  ha,p: 
(pfj  p — » {  ±  1 }  whose  empirical  error  is 

C  —  ^  ^  *  ^cr,p(0a',p,i)  '  Vi  (11) 

Vi 

At  each  step,  every  appearance  feature  parametrized 
by  its  scale  and  its  partition,  together,  form  a  family  of  hy¬ 
potheses  h  =  {hup}.  AdaBoost  then  chooses  a  hypoth¬ 
esis  that  carries  minimum  error.  Effectively,  this  means 
that  each  AdaBoost  iteration  picks  the  hypothesis,  and  in 
turn  the  individual  feature  vector,  that  contains  the  most 
discriminating  information  allowing  a  correction  of  clas¬ 
sification  errors  resulted  from  previous  steps.  The  feature 
selective  AdaBoost  [12]  is  outlined  below. 

•  Given  a  training  set  containing  positive  and  negative 
samples,  where  each  sample  i  is  (d?j,t/j);  is  the 
composite  feature  vector  of  sample  i,  and  yi  £  {±1}  is 
the  corresponding  class  label.  Initialize  sample  distri¬ 
bution  D0  by  weighting  every  training  sample  equally. 


•  For  T  iterations  do 

-  Train  a  hypothesis  for  each  feature  <t>a,P. 

-  Choose  the  hypothesis  h{  with  minimum  classi¬ 
fication  error  et  on  the  weighted  samples. 

-  Compute  at  =  which  weights  h{ 

by  its  classification  performance. 

-  Update  and  normalize  the  weighted  distribution: 

Dt+1(i)  cx  Dt(i)  .e~atVih*^). 

•  The  final  hypothesis  H(cj>)  =  sign(Y}J=i  ottht  )) 
is  a  linear  combination  of  T  hypotheses  that  are  func¬ 
tions  of  selected  features. 

3.2  Support  Vector  Machines 

Unlike  traditional  classification  techniques  that  aim  at  min¬ 
imizing  the  Empirical  Risk,  SVM  approaches  the  classifi¬ 
cation  problem  as  an  approximate  implementation  of  the 
Structural  Risk  Minimization  (SRM)  induction  principle, 
which  is  a  reduction  form  of  an  Expected  Risk  minimiza¬ 
tion  problem  [14,  15].  To  this  end,  a  generalization  error 
of  a  model  is  minimally  bounded  and  a  decision  surface  is 
placed  in  such  a  way  that  the  margin,  which  is  the  distance 
from  a  separating  hyperplane  to  the  closest  positive  or  neg¬ 
ative  sample,  between  different  classes  is  maximized. 

SVM  approximates  the  solution  to  the  minimization 
problem  of  SRM  through  a  Quadratic  Programming  opti¬ 
mization.  As  a  result,  a  subset  of  training  samples  is  cho¬ 
sen  as  support  vectors  that  determine  the  decision  boundary 
hyperplane  of  the  classifier. 

Though  in  principle  the  hyperplanes  can  only  learn 
linearly  separable  datasets,  in  practice,  nonlinearity  is 
achieved  by  applying  an  SVM  kernel  that  maps  an  input 
vector  onto  a  higher  dimensional  feature  space  implicitly. 

In  this  paper,  we  use  SVM  [16]  with  the  radial  basis 
function  kernel  as  a  black  box  classifier  over  the  labeled 
set  of  composite  feature  vectors.  Doing  so  yields  support 
vectors  in  the  composite  feature  space. 

3.3  Multi-class  classifiers 

Both  AdaBoost  and  SVM,  as  explained  above,  are  suitable 
only  for  binary  classification.  However,  they  can  be  easily 
extended  to  a  multi-class  problem  by  utilizing  Error  Cor¬ 
recting  Output  Codes  (ECOC)  [17]. 

A  dichotomy  is  a  two-class  classifier  that  learns  from 
data  labeled  with  positive  (+),  negative  (-),  or  (don’t  care). 
Given  any  number  of  classes,  we  can  relabel  them  with 
these  three  symbols  and  thus  form  a  dichotomy.  Differ¬ 
ent  relabellings  result  in  different  two-class  problems  each 
of  which  is  learned  independently.  A  multi-class  classifier 
progresses  through  every  selected  dichotomy  and  chooses 
a  class  that  is  correctly  classified  by  the  maximum  number 
of  selected  dichotomies. 


Exhaustive  dichotomies  represent  a  set  of  all  possible 
ways  of  dividing  and  relabeling  the  dataset  with  the  three 
defined  symbols.  A  one-against-all  classification  scheme 
on  an  n-class  classification  considers  n  dichotomies  each 
relabel  one  class  as  (+)  and  all  other  classes  as  (-). 

4  Face  Expression  Recognition 

We  applied  our  integrated  feature  selection  and  classifica¬ 
tion  approach  to  the  problem  of  identifying  expressions 
on  faces.  The  features  described  in  Section  2  are  consid¬ 
ered  appropriate  because  facial  expressions  have  character¬ 
istic  local  structures  that  can  be  mathematically  described 
by  edge  orientations  and  curvatures  (functions  of  Gaussian 
derivatives)  or  more  generally  spatial  frequencies  (Gabor 
wavelets).  Although  we  look  at  a  person’s  face  as  a  whole, 
we  focus  our  attention  only  on  small  regions  at  any  instant 
in  time  because  expressions  are  mostly  localized  to  regions 
near  the  eyes  and  the  mouth.  A  smile  is  mostly  shown  by  a 
person’s  mouth,  while  anger  is  partly  shown  by  a  person’s 
eyes.  Cheeks  and  noses  contain  much  less  significant  in¬ 
formation.  Since  our  approach  is  well  suited  to  single  out 
discriminative  features  both  at  the  global  level  and  multiple 
local  levels,  it  is  ideal  for  this  problem  domain. 

4.1  Implementation 

We  conducted  our  experiments  on  the  AR  face  database 
[18],  We  chose  face  images  of  120  people:  55  women 
and  65  men.  Each  person  shows  four  expressions:  neutral, 
smile,  anger,  and  scream.  There  are  two  images  of  each 
person’s  expression  that  were  taken  from  two  different  ses¬ 
sions.  Thus  in  all  we  have  a  total  of  960  facial  images  with 
240  images  for  each  expression.  We  manually  cropped  ev¬ 
ery  face  image  to  remove  the  influence  of  the  background. 
This  is  not  an  absolute  necessary  for  our  method  if  all  the 
subjects  were  located  at  roughly  the  same  region  on  every 
image.  An  example  face  with  four  different  expressions  is 
shown  in  the  first  column  of  Figure  1 . 

Both  the  Gaussian  and  the  Gabor  features  were  ex¬ 
tracted  at  6  scales,  with  corresponding  partitions  of  lxl, 
3x3,  3x3,  5  x5,  5  x5,  and  7x7.  The  first  scale  corresponds 
to  the  lowest  frequency  with  a i  =  30  pixels  per  cycle. 
Subsequent  frequencies  are  determined  by  a,=  ^'--\)  for 

i  =  {2,  3, ...,  6}.  The  choice  for  the  first  scale,  intention¬ 
ally  to  be  used  with  a  non-partitioned  image,  is  guided  by 
the  mean  image  size  of  the  cropped  faces  in  the  database, 
which  is  to  cover  roughly  two  cycles  of  the  defined  filter. 
On  the  same  basis,  higher  frequencies  determine  finer  im¬ 
age  partitions  so  that  the  sizes  of  subimages  are  roughly 
two  cycle  wide. 

The  number  of  histogram  bins  chosen  is  64.  While  the 
Gaussian  derivative  responses  are  uniformly  discretized, 
the  distribution  of  the  Gabor  responses  is  nonlinearly  de¬ 
fined  where  the  values  ranging  from  -0.04  to  0.04  are  di¬ 
vided  equally  into  62  intervals  making  up  the  middle  62 


Figure  1 .  An  example  face  showing  four  expressions  and  corre¬ 
sponding  features  at  the  coarsest  scale.  The  fi  rst  column  shows  the 
expressions.  The  normalized  shape  index  and  the  orientation  his¬ 
tograms  are  shown  in  the  second  and  third  column,  respectively. 
The  last  column  shows  the  distribution  of  Gabor  fi  iter  responses. 


Figure  2.  Performance  on  the  test  data  (anger  included) 


bins,  the  first  covers  all  values  less  than  -0.04,  and  those 
more  than  0.04  fall  into  the  last  bin.  Examples  of  the  fea¬ 
tures  extracted  is  shown  in  Figure  1.  The  composite  feature 
is  very  high  dimensional.  For  example,  with  our  configu¬ 
rations,  the  dimension  of  the  Gaussian  composite  feature 
vector  is  15,104. 

The  number  of  iterations  for  AdaBoost  is  set  at  50. 
Both  SVM  and  AdaBoost  performed  multi-class  classifi¬ 
cation  by  employing  one-against-all  dichotomies.  For  Ad¬ 
aBoost,  we  also  tested  on  an  exhaustive  set  of  dichotomies. 
All  statistical  results  of  our  experiments  are  based  on  a 
5-fold  cross  validation  analysis  [19]  where  the  classifiers 
were  trained  on  80%  of  the  data  and  tested  on  the  other 
20%  in  5  runs,  each  run  retaining  different  subsets  (20%  in 
each  case)  of  the  data  as  the  test  set. 

4.2  Results  and  Analysis 

The  results  from  two  sets  of  experiments,  the  first  including 
the  anger  expression  and  the  second  excluding  it,  are  shown 


Figure  3.  Performance  on  the  test  data  (anger  excluded) 


Figure  4.  Two-class  classifi  cation  performance  of  AdaBoost  on 
the  test  data.  On  the  x-axis,  denote  the  neutral,  smile,  anger,  and 
scream  by  N,  S,  A  and  Sc,  respectively. 


Figure  5.  Examples  showing  the  similarity  of  neutral  (top  row) 
and  anger  (bottom  row)  expressions  in  the  database. 


Figure  6.  Feature  selections  by  AdaBoost.  The  dichotomies  from 
top  to  bottom  and  left  to  right  are  [1  -1  -1],  [1  0  -1],  [-1  1  -1],  [0  1 
- 1],  [  1  1-1],  [1-10].  Darker  regions  represent  a  low  accumulation 
of  a  values  and  their  non-discriminative  nature.  Brighter  regions 
represent  their  highly  discriminative  nature  for  their  dichotomies. 


in  Figures  2  and  3,  respectively.  In  both  cases,  the  best  aver¬ 
age  performance  was  obtained  using  the  orientation  feature 
with  AdaBoost  classifier;  79.27%  and  94.86%  respectively. 
The  reason  for  the  apparently  poor  performance  in  the  first 
set  of  experiments  was  that  neutral  and  anger  as  expressed 
by  people  in  the  database  were  not  visibly  different  in  most 
cases  (see  Figure  5).  This  is  further  substantiated  by  exam¬ 
ining  the  performance  of  two  class  classifiers,  illustrated 
in  Figure  4.  It  is  clear  that  the  classifier  is  having  a  hard 
time  discriminating  between  the  neutral  and  anger  classes, 
although  it  is  performing  very  well  on  every  other  case. 

Experimental  results  show  that  performances  of  S  VM 
and  AdaBoost  are  comparable.  They  performed  almost 
equally  well  with  a  slight  preference  toward  AdaBoost 
when  an  exhaustive  set  of  dichotomies  is  employed.  On  av¬ 
erage,  SVM  tends  to  have  higher  variances.  Another  draw¬ 
back  of  SVM  is  its  dependency  on  parameter  settings;  the 
choices  of  kernel  function  and  its  parameters  are  crucial. 

It  is  important  to  note  that  while  AdaBoost  feature  se¬ 
lection  provides  a  final  hypothesis  model  that  can  be  easily 
interpreted,  the  high  dimensional  support  vectors  of  SVM 
approach  do  not  provide  any. 


Figure  6  illustrates  image  regions  where  AdaBoost 
picked  out  discriminating  features.  Each  image  is  a  re¬ 
sult  on  a  particular  dichotomy  represented  by  a  code  word, 
where  -1,  +1  indicate  negative  and  positive  samples,  and  0 
is  a  don’t  care  label.  In  a  code  word,  the  first  number  rep¬ 
resents  a  neutral  expression,  the  second  and  third  represent 
smile  and  scream.  The  images  show  accumulated  a  values 
over  all  iterations.  Image  regions  with  higher  values  con¬ 
tribute  more  to  the  final  hypothesis.  Intuitively,  the  higher 
the  value  a  region  carries,  the  more  influence  it  has  on  the 
final  classification  decision,  and  consequently  the  more  rel¬ 
evant  information  to  a  classification  task  it  contains. 

As  reflected  in  the  results,  AdaBoost  successfully 
picked  the  mouth  and  the  eyes  as  being  most  informative 
and  discarded  other  regions  as  being  irrelevant.  This  is  true 
because  a  person’s  mouth  and  eyes  look  different  while  ex¬ 
pressing  neutral,  smile,  or  scream.  Clearly,  an  appearance 
of  a  mouth  region  contains  significant  information.  Also 
in  this  dataset,  people  scream  with  their  eyes  closed  (see 
Figure  1)  which  results  in  the  contribution  from  the  eye  re¬ 
gions.  Additionally,  this  result  draws  similarity  to  how  hu¬ 
mans  naturally  perceive  and  recognize  facial  expressions. 
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Figure  7.  Convergence  of  feature  selective  AdaBoost  classifi  ers 
on  the  six  dichotomies  defi  ned  in  Figure  6. 


When  AdaBoost  with  feature  selection  is  employed, 
the  memory  requirements  and  the  computational  complex¬ 
ity  are  significantly  reduced.  During  online  classification 
only  those  features  chosen  by  AdaBoost  are  needed  instead 
of  the  composite  feature  as  a  whole.  Specifically,  let  S 
be  the  total  number  of  appearance  features  defined  and  T 
be  the  number  of  iterations  run  by  AdaBoost,  then  a  com¬ 
pression  ratio  r  of  S:T  is  achieved.  To  be  concrete,  in  our 
experiment,  r  =  1 18:50,  which  is  more  than  double.  Keep¬ 
ing  in  mind  that  the  actual  dimension  of  the  feature  vector 
is  S'  x  &  where  b  is  the  number  of  histogram  bins  (64  in  our 
case),  this  results  in  an  order  of  magnitude  reduction  in  di¬ 
mensionality.  Furthermore,  the  value  of  T  can  be  set  much 
lower  than  50  as  the  plot  in  Figure  7  shows  that  AdaBoost 
converges  quickly  in  most  cases,  i.e.  within  25  iterations. 
This  further  doubles  the  compression  ratio. 

5  Conclusions  and  Future  Works 

We  have  proposed  a  classification  system  capable  of  cap¬ 
turing  both  global  and  local  features  and  at  the  same  time 
identifying  image  regions  where  distinctive  information  is 
located.  We  successfully  applied  this  technique  to  the  face 
expression  recognition  problem. 

The  main  benefit  gained  from  this  new  feature  extrac¬ 
tion  and  image  classification  approach  is  the  meaningful  vi¬ 
sualization  of  informative  image  regions  and  the  reduction 
of  computational  complexity  without  applying  any  domain 
knowledge.  The  system  automatically  learns  from  training 
data  where  to  look  for  discriminating  information.  The  re¬ 
duced  feature  set  then  enables  fast  online  classification. 

This  technique  can  be  applied  to  a  broad  range  of 
recognition  and  classification  applications  as  long  as  the 
objects  to  be  classified  are  at  similar  location,  orientation, 
and  scale  in  both  the  training  and  the  testing  images.  How¬ 
ever,  if  the  system  is  used  in  conjunction  with  appropriate 
segmentation  and  rectification  algorithms  then  these  con¬ 
straints  can  be  removed. 
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